68 research outputs found

    Convolutional LSTM Networks for Subcellular Localization of Proteins

    Get PDF
    Machine learning is widely used to analyze biological sequence data. Non-sequential models such as SVMs or feed-forward neural networks are often used although they have no natural way of handling sequences of varying length. Recurrent neural networks such as the long short term memory (LSTM) model on the other hand are designed to handle sequences. In this study we demonstrate that LSTM networks predict the subcellular location of proteins given only the protein sequence with high accuracy (0.902) outperforming current state of the art algorithms. We further improve the performance by introducing convolutional filters and experiment with an attention mechanism which lets the LSTM focus on specific parts of the protein. Lastly we introduce new visualizations of both the convolutional filters and the attention mechanisms and show how they can be used to extract biological relevant knowledge from the LSTM networks

    Integrating sequence and structural biology with DAS.

    Get PDF
    BACKGROUND: The Distributed Annotation System (DAS) is a network protocol for exchanging biological data. It is frequently used to share annotations of genomes and protein sequence. RESULTS: Here we present several extensions to the current DAS 1.5 protocol. These provide new commands to share alignments, three dimensional molecular structure data, add the possibility for registration and discovery of DAS servers, and provide a convention how to provide different types of data plots. We present examples of web sites and applications that use the new extensions. We operate a public registry of DAS sources, which now includes entries for more than 250 distinct sources. CONCLUSION: Our DAS extensions are essential for the management of the growing number of services and exchange of diverse biological data sets. In addition the extensions allow new types of applications to be developed and scientific questions to be addressed. The registry of DAS sources is available at http://www.dasregistry.org.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

    Physicochemical Properties, Cytotoxicity, and Antioxidative Activity of Natural Deep Eutectic Solvents Containing Organic Acid

    Get PDF
    Natural deep eutectic solvents (NADES) may be considered ‘designer solvents’ due to their numerous structural variations and the possibility of tailoring their physicochemical properties. Prior to their industrial application, characterization of NADES is essential, including determination of their physicochemical properties, cytotoxicity, and antioxidative activity. The most important physicochemical properties of eight prepared NADES (choline chloride:malic acid, proline:malic acid, choline chloride:proline:malic acid, betaine:malic acid, malic acid:glucose, malic acid:glucose:glycerol, choline chloride:citric acid, and betaine:citric acid) were measured as functions of temperature and water content. In general, the structure of prepared NADES greatly influences their physical properties, which could be successfully modified and adjusted by addition of water. All tested NADES were absolutely benign and noncorrosive for investigated steel X6CrNiTi18-10. Furthermore, cytotoxicity of prepared solvents was assessed toward three human cell lines (HEK-293T, HeLa, and MCF-7 cells), and antioxidative activity was measured by the Oxygen Radical Absorbance Capacity (ORAC) method. With regard to cell viability, all tested NADES containing carboxylic acid could be classified as practically harmless and considered environmentally safe. The ORAC values indicated that the tested NADES displayed antioxidative activity

    PROSITE, a protein domain database for functional characterization and annotation

    Get PDF
    PROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them. It is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of these profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. PROSITE is largely used for the annotation of domain features of UniProtKB/Swiss-Prot entries. Among the 983 (DNA-binding) domains, repeats and zinc fingers present in Swiss-Prot (release 57.8 of 22 September 2009), 696 (∼70%) are annotated with PROSITE descriptors using information from ProRule. In order to allow better functional characterization of domains, PROSITE developments focus on subfamily specific profiles and a new profile building method giving more weight to functionally important residues. Here, we describe AMSA, an annotated multiple sequence alignment format used to build a new generation of generalized profiles, the migration of ScanProsite to Vital-IT, a cluster of 633 CPUs, and the adoption of the Distributed Annotation System (DAS) to facilitate PROSITE data integration and interchange with other sources. The latest version of PROSITE (release 20.54, of 22 September 2009) contains 1308 patterns, 863 profiles and 869 ProRules. PROSITE is accessible at: http://www.expasy.org/prosite/

    Chaste: an open source C++ library for computational physiology and biology

    Get PDF
    Chaste - Cancer, Heart And Soft Tissue Environment - is an open source C++ library for the computational simulation of mathematical models developed for physiology and biology. Code development has been driven by two initial applications: cardiac electrophysiology and cancer development. A large number of cardiac electrophysiology studies have been enabled and performed, including high performance computational investigations of defibrillation on realistic human cardiac geometries. New models for the initiation and growth of tumours have been developed. In particular, cell-based simulations have provided novel insight into the role of stem cells in the colorectal crypt. Chaste is constantly evolving and is now being applied to a far wider range of problems. The code provides modules for handling common scientific computing components, such as meshes and solvers for ordinary and partial differential equations (ODEs/PDEs). Re-use of these components avoids the need for researchers to "re-invent the wheel" with each new project, accelerating the rate of progress in new applications. Chaste is developed using industrially-derived techniques, in particular test-driven development, to ensure code quality, re-use and reliability. In this article we provide examples that illustrate the types of problems Chaste can be used to solve, which can be run on a desktop computer. We highlight some scientific studies that have used or are using Chaste, and the insights they have provided. The source code, both for specific releases and the development version, is available to download under an open source Berkeley Software Distribution (BSD) licence at http://www.cs.ox.ac.uk/chaste, together with details of a mailing list and links to documentation and tutorials

    Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment

    Get PDF
    Motivation: Many proteins with vastly dissimilar sequences are found to share a common fold, as evidenced in the wealth of structures now available in the Protein Data Bank. One idea that has found success in various applications is the concept of a reduced amino acid alphabet, wherein similar amino acids are clustered together. Given the structural similarity exhibited by many apparently dissimilar sequences, we undertook this study looking for improvements in fold recognition by comparing protein sequences written in a reduced alphabet. Results: We tested over 150 of the amino acid clustering schemes proposed in the literature with all-versus-all pairwise sequence alignments of sequences in the Distance matrix alignment (DALI) database. We combined several metrics from information retrieval popular in the literature: mean precision, area under the Receiver Operating Characteristic curve and recall at a fixed error rate and found that, in contrast to previous work, reduced alphabets in many cases outperform full alphabets. We find that reduced alphabets can perform at a level comparable to full alphabets in correct pairwise alignment of sequences and can show increased sensitivity to pairs of sequences with structural similarity but low-sequence identity. Based on these results, we hypothesize that reduced alphabets may also show performance gains with more sophisticated methods such as profile and pattern searches

    Podbat: A Novel Genomic Tool Reveals Swr1-Independent H2A.Z Incorporation at Gene Coding Sequences through Epigenetic Meta-Analysis

    Get PDF
    Epigenetic regulation consists of a multitude of different modifications that determine active and inactive states of chromatin. Conditions such as cell differentiation or exposure to environmental stress require concerted changes in gene expression. To interpret epigenomics data, a spectrum of different interconnected datasets is needed, ranging from the genome sequence and positions of histones, together with their modifications and variants, to the transcriptional output of genomic regions. Here we present a tool, Podbat (Positioning database and analysis tool), that incorporates data from various sources and allows detailed dissection of the entire range of chromatin modifications simultaneously. Podbat can be used to analyze, visualize, store and share epigenomics data. Among other functions, Podbat allows data-driven determination of genome regions of differential protein occupancy or RNA expression using Hidden Markov Models. Comparisons between datasets are facilitated to enable the study of the comprehensive chromatin modification system simultaneously, irrespective of data-generating technique. Any organism with a sequenced genome can be accommodated. We exemplify the power of Podbat by reanalyzing all to-date published genome-wide data for the histone variant H2A.Z in fission yeast together with other histone marks and also phenotypic response data from several sources. This meta-analysis led to the unexpected finding of H2A.Z incorporation in the coding regions of genes encoding proteins involved in the regulation of meiosis and genotoxic stress responses. This incorporation was partly independent of the H2A.Z-incorporating remodeller Swr1. We verified an Swr1-independent role for H2A.Z following genotoxic stress in vivo. Podbat is open source software freely downloadable from www.podbat.org, distributed under the GNU LGPL license. User manuals, test data and instructions are available at the website, as well as a repository for third party–developed plug-in modules. Podbat requires Java version 1.6 or higher

    Circular Permutation in Proteins

    Get PDF
    This is a ‘‘Topic Page’ ’ article for PLoS Computational Biology. Circular permutation describes a type of relationship between proteins, whereby the proteins have a changed order of amino acids in their protein sequence, such that the sequence of the first portion of one protein (adjacent to the N-terminus) is related to that of the second portion of the other protein (near its C-terminus), and vice versa (see Figure 1). This is directly analogous to the mathematical notion of a cyclic permutation over the set of residues in a protein. Circular permutation can be the result of evolutionary events, post-translational modifications, or artificially engineered mutations. The result is a protein structure with different connectivity, but overall similar three-dimensional (3D) shape. The homology between portions of the proteins can be established by observing similar sequences between N- and C-terminal portions of the tw
    corecore